EXAMINING GUIDELINES FOR DEVELOPING 
AGGURATE PROFIGIENGY LEVEL SGORES 
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One attempt to make scores from large-scale assessments more interpretable has been 
to provide proficiency level scores to describe the meaning of student performance on 
tests. This study has examined the accuracy of Ercikan and Julian's (2002) guidelines 
for developing proficiency level scores and the classification accuracy of proficiency 
level scores from British Columbia's Foundation Skills Assessment tests. The 
guidelines were examined by comparing expected classification accuracies, based on 
these guidelines, to those estimated using a statistical procedure. The guidelines 
provided accurate expected classification accuracies to use in making decisions about 
assessment design. 
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L'une des fagons utilisees pour faciliter Tinterpretation des resultats d'epreuves 
communes a ete de fournir des scores de rendement comparatifs en fonction de 
normes de reference. Dans cet article, les auteurs analysent la pertinence des 
directives d'Ercikan et Julian (2002) ayant trait a Telaboration des scores de 
rendement et Texactitude du classement des scores de rendement dans les tests 
d' evaluation des competences fondamentales en Colombie-Britannique. L' analyse 
des directives a donne lieu a une comparaison entre Texactitude du classement en 
fonction des directives et Texactitude du classement obtenue par une methode 
statistique. Les directives ont produit des classements exacts et conformes aux 
previsions et peuvent servir dans les decisions a prendre au sujet de la conception 
des evaluations. 

Mots cles : scores de rendement, exactitude du classement, conception de Tevaluation, 
fidelite 
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One attempt to make scores from large-scale assessments more 
interpretable has been to provide proficiency level scores fo describe fhe 
meaning of sfudenf performance on fesfs. Proficiency level score 
reporting is widely used in national as well as provincial achievemenf 
fesfs such as fhe School Achievemenf Indicafors Program (SAIP) and fhe 
British Columbia Foundation Skills Assessments (FSA). In these 
assessments, performance is represenfed by fhe classificafion of sfudenf 
performance fo a number of proficiency levels defermined by a sfandard 
setting process. These proficiency levels may have a sef of labels such as 
Basic, Proficienf, and Advanced, and descriptions of performance af each 
proficiency level. Once fhese scores are released, fypical users, educafors 
or policy makers do nof quesfion fhe accuracy of proficiency level scores. 
Yef fhese types of scores involve errors in classifying sfudenf 
performance fo differenf levels, especially when fhe number of 
proficiency levels and how fhe classificafions are obfained do nof mafch 
fhe properfies of fhe fesfs on which fhe scores are based. 
Misclassificafions of sfudenf performance fo differenf proficiency levels 
jeopardize fhe validify of inferences abouf achievemenf frends and fhe 
policy decisions fhese assessmenfs are intended to inform. 

Given fhe increased use of proficiency level classificafions as 
imporfanf indicafors of learning oufcomes fo describe sfudenf 
performance, if is imporfanf fo examine fhe accuracy of fhese 
classificafions. Classificafion accuracy refers fo accuracy of decisions 
made based on fesf scores rafher fhan fhe accuracy of scores. This nofion 
of accuracy is fypically inferprefed as consisfency of classificafions based 
on fhe same or parallel fesfs. Several aufhors have discussed and 
demonsfrafed procedures fo esfimafe accuracy or consisfency of 
classificafions based on fesf scores (Fluynh, 1976; Livingston & Lewis, 
1995; Livingston & Wingersky, 1979; Subkoviak, 1976; Swaminafhan, 
Flamblefon, & Algina, 1974; Traub, Flaerfel, & Shavelson, 1996; Wilcox, 
1981). Previous research has shown fhaf one facfor fhaf defermines fhe 
accuracy of classificafions is fhe measuremenf precision provided by fhe 
fesf (Flamblefon & Slater, 1997; Livingsfon & Lewis, 1995; Traub & 
Rowley, 1980), particularly measuremenf precision af cuf-score poinfs. 
Specifically, measuremenf error near fhe cuf-scores provides informafion 
abouf fhe likelihood of misclassificafion errors, i.e., false-positive and 
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false-negative errors. One factor that affects classification accuracy is the 
distance between cut-scores. When the cut-scores are closer to each 
other, the likelihood of false-positive and false-negative 
misclassifications is higher. Higher numbers of proficiency levels 
typically result in cut-scores that are closer to each other than if a smaller 
number of proficiency levels were used. Therefore, the higher the 
number of proficiency levels, the higher the probability that students 
may be misclassified. 

In practical large-scale assessment situations, procedures are 
available to estimate classification accuracy for proficiency scores based 
on a single test administration once the assessment results have been 
determined (Huynh, 1976, 1979; Livingston & Lewis, 1995; Subkoviak, 
1976). However, for most assessment purposes, and especially for high 
stakes decision-making purposes, discovery of unreliable proficiency 
level scores after the completion of the assessment is problematic. 
Therefore, in the assessment design stage, guidelines are needed to 
answer questions, such as (a) Given the test length and reliability, and 
the desired level of classification accuracy, how many proficiency levels 
can be used for reporting assessment results?; (b) Given the test length 
and reliability, and for a specific number of proficiency levels, what type 
of classification accuracy can be expected?; (c) For a certain number of 
proficiency levels with an identified level of classification accuracy, what 
type of reliability, or test length, is needed? 

Ercikan and Julian (2002) presented guidelines to answer these 
questions. The purpose of the present study is to examine the accuracy of 
these guidelines by comparing expected classification accuracy based on 
the guidelines to the estimated classification accuracy using a statistical 
method to estimate classification accuracy using a single test 
administration. The classification accuracy is estimated for a large-scale 
assessment, namely the British Golumbia Foundation Skills Assessment 
(FSA) using Huynh's Beta-nomial classification accuracy estimation 
procedure (Huynh, 1979), and these estimates are compared to 
classification accuracies based on the Ercikan and Julian (2002) 
guidelines. In addition to providing results regarding the accuracy of the 
Ercikan and Julian guidelines, the estimation of classification accuracy 
for the FSA serves an additional purpose. Because the FSA is similar to 
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many provincial assessments in Canada in terms of its scope and 
characteristics, the classification accuracies obtained for fhe FSA may 
provide information abouf fhe kinds of classificafion accuracies fhaf may 
be expecfed from ofher assessmenfs wifh similar fesf lengfh and 
measuremenf accuracies. 

ERCIKAN AND JULIAN GUIDELINES 

Ercikan and Julian (2002) based fheir guidelines for fesf design on a 
simulation sfudy. In fhis sfudy, fhey examined classificafion accuracy as 
a function of fhree facfors: measuremenf precision, number of 
proficiency levels, and score level. They examined, separafely as well as 
joinfly, fhe effecfs of each of fhese facfors on classificafion accuracy by 
varying fhe levels of fhese facfors and observing fhe effecf on 
classificafion accuracy. They defined classificafion accuracy as fhe 
agreemenf of classifications based on frue and observed scores. The 
agreemenf indicators po, per cenf agreemenf across classificafion 
cafegories, and Cohen's k (Cohen, 1960) were used as measures of 
agreemenf. The variafion in measuremenf precision was provided by 
simulating observed and frue scores, using parameters from fen fesfs 
whose reliabilities ranged from 0.70 to 0.93. The number of proficiency 
levels varied befween fwo and five, and fhe analyses were repealed for 
fwo differenf sefs of cuf-scores. The resulfs from fhis simulation sfudy 
can be summarized as follows: Classificafion accuracy is affected by 
measuremenf precision, as would be expecfed, and decreases as fhe 
number of proficiency levels increases. For a given reliabilify level, fhe 
classificafion accuracy, as would be estimated by po and ic, decreased on 
average by 10 per cenf for an increase of one proficiency level, 20 per 
cenf for an increase of fwo proficiency levels, and 20 per cenf fo 30 per 
cenf for an increase of fhree proficiency levels. In addition, classificafion 
accuracy was more sensitive fo measuremenf precision when larger 
numbers of proficiency levels were considered. In ofher words, change 
in classificafion accuracy wifh changes in reliabilify is greafer when 
higher numbers of proficiency levels are considered. 

The minimum required fesf reliabilifies presenfed in Ercikan and 
Julian (2002) for a desired level of classificafion accuracy are summarized 
in Table 1 for fwo, fhree, four, and five proficiency levels. These 
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guidelines suggest that when a reliability estimate of 0.85 may be 
sufficient for obfaining classificafion accuracy of 0.90 for fwo proficiency 
levels, a reliabilify esfimafe of 0.95 or higher would be needed for fhree 
proficiency levels. A classificafion accuracy of 0.90 would be highly 
unlikely for four or larger numbers of proficiency levels. To obfain a 
classificafion accuracy level of 0.80, fesfs wifh reliabilifies of af leasf 0.70, 
0.80, and 0.95 would be needed for fwo, fhree, and four proficiency 
levels respectively; if larger numbers of proficiency levels, such as four 
or five, are needed, more modesf classificafion accuracies such as 0.50 fo 
0.70 should be expecfed even wifh reliabilifies as high as 0.90. 


Table 1 

Required Minimum Reliability Estimates for the Desired Classification 
Accuracy for 2, 3, 4 and 5 Proficiency Levels 


Desired 

Classificafion 
Accuracy (po) 

Number of Proficiency levels 


2 

3 

4 

5 

0.90 

0.85 

0.95 

Nof likely 

Nof likely 

0.80 

0.70 

0.80 

0.95 

Nof likely 

0.70 

- 

- 

0.80 

0.90 

0.60 

- 

- 

0.70 

0.75 

0.50 




0.70 


VERIFICATION OF GUIDELINES 

Using dafa from fhe ESA 2000 fesfs in fhis sfudy, Ercikan examined fhe 
accuracy of fhe guidelines presenfed in Ercikan and Julian (2002) by 
comparing fhese guidelines fo classificafion accuracy esfimafes using fhe 
Huynh's (1979) Befa-nomial procedure. 
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Foundation Skills Assessment (FSA) Tests 

The FSA is part of the British Columbia provincial assessments. The 
performance on the FSA tests are reported in terms of three proficiency 
levels: Not Yet Within Expectations, Meeting Expectations, and 

Exceeding Expectations. Students who have not attained the "meets 
expectations" standard are considered to be "not yet within 
expectations." The top two proficiency levels were defined as follows: 

Meets expectations. The level of performance at which a student meets or 
exceeds the widely held expectations for the grade on this test. With no other 
information, this is the level below which a teacher would want to know more 
about the reasons for a student's low performance. 

Exceeds expectations. The level of a student's performance that is beyond that 
at which a teacher would say the student has fully met the expectations of the 
grade on this test. Students' performance would be considered excellent for the 
grade on this test. (British Columbia Ministry of Education, 2001, p. 23) 

For this study, a representative 10 per cent sample of data for each of 
the grades 4, 7 and 10 from the Year 2000 assessment was obtained. 
Students who took the tests in French, ranging from 9 to 25 students for 
each grade, were eliminated from the sample because the properties of 
the tests may vary for this group. The numbers of students, means, and 
standard deviations for each test and other descriptive statistics are 
presented in Tables 2 and 3. The FSA tests contained both multiple- 
choice and constructed-response items and the maximum possible scores 
ranged from 48 to 56. These tests had moderate to high difficulty levels, 
with the per cent of maximum score ranging from 0.50 to 0.71. The 
coefficient-alpha reliability estimate was used to estimate the reliabilities 
of these tests, given both dichotomously and polytomously scored item 
types. The reliabilities of the scores ranged from 0.84 to 0.88. 

Classification Accuracy 

Two classification accuracy indices were used in the study, po andK. The 
most commonly used measure of classification accuracy is a simple 
measure of agreement, po, defined as the total proportion of examinees 
who were classified into the same proficiency level according to their 
true score and observed score across all possible proficiency levels. 
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Another commonly used classification accuracy indicator is Cohen's k 
coefficient (Cohen, 1960). This statistic is similar to the proportion 
agreement po, except that it is corrected for fhe agreemenf fhaf is due fo 
chance. Neifher of fhese classificafion indices disfinguishes among 
differenf degrees of misclassificafions such as misclassifying examinees 
by one proficiency level versus fwo proficiency levels. For fhe purposes 
of fhis sfudy, all misclassificafions are freafed as equally imporfanf. 

Table 2: Foundation Skills Assessment (FSA) 2000 Sample Data and Tests 


Subjecf 

Grade 

# 

Ifems (Max score) 

Sample size 

Reading 

4 

39 (51) 

4710 


7 

44 (56) 

4724 


10 

43 (55) 

4648 

Numeracy 

4 

36 (48) 

4705 


7 

36 (48) 

4685 


10 

36 (48) 

4737 


Table 3: Descriptive Statistics Based on the Foundation Skills Assessment (FSA) 2000 

Sample Data 


Subject 

Grade 

Average 
% of max. 

Cut- 

scores^ 

Mean 

SD 

Coefficient- 

a 

Reading 

4 

67 

27,41 
(232, 361) 

33.86 

8.79 

0.86 


7 

68 

31, 48 
(225, 425) 

37.81 

8.59 

0.84 


10 

71 

32,49 
(230, 420) 

38.62 

9.06 

0.87 

Numeracy 

4 

52 

17,39 
(239, 465) 

25.12 

9.73 

0.87 


7 

58 

19,41 
(237, 473) 

27.50 

9.86 

0.87 


10 

50 

17,39 
(218, 424) 

24.10 

9.88 

0.88 


'On raw-score scale (on scale-score scale) 
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Measurement Precision at Cut-score Points 

Classification accuracy is closely related to measurement precision 
provided at the cut-score points. To examine the measurement precision 
provided at the two cut-score points in FSA, the six FSA tests were 
calibrated using an item response theory (IRT) based approach. The 
multiple-choice items were calibrated using the 3-Parameter Logistic 
(3PL) model (Lord, 1980) and the constructed-response items were 
calibrated using the 2-Parameter Partial Credit (2PPC) model (Yen, 1993). 
The estimations were conducted using PARDUX (Burket, 1991). The 
standard error of measurement (SEM) for each 6 score was computed 
based on the item parameter estimates using FLUX (Burket, 1993). 

Beta-nomial Procedure 

Analyses focused on estimating classification accuracy for each of the six 
FSA tests using Fluynh's Beta-nomial procedure and comparing these 
estimates to those classification accuracies that would be expected based 
on the Ercikan and Julian guidelines. The Beta-nomial procedure uses 
the mean and the standard deviation of raw scores, the reliability 
estimate, maximum possible score points, the number of proficiency 
levels, and the cut-scores based on the raw score scale to estimate 
classification accuracy estimates po and k. Raw scores are defined as the 
the sum of scores across all items. The reliability was estimated using 
Cronbach-a (Cronbach, 1951). Cut-scores are scores that are used for 
classifying examinees to different proficiency levels. The Fluynh method 
assumes that the test scores on each test follow a Beta-nomial model. The 
classification accuracy indicators, po and k, are then computed using the 
Beta-nomial distribution. The Beta-nomial procedure was implemented 
using a DOS based software developed by Fluynh (1979). 

RESULTS 

In this study, Ercikan used the Beta-nomial procedure (Fluynh, 1979), 
which can be used to estimate classification accuracy based on a single 
test administration, to examine the reasonableness of the guidelines 
provided in Ercikan and Julian (2002). In addition, classification accuracy 
in the British Columbia Foundation Skills Assessment (FSA) for grades 4, 
7, and 10 on reading and numeracy was examined. The sections below 
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describe results of the analyses investigating properties of the FSA tests, 
classification accuracy of proficiency level scores from these tests, and 
compare estimates of classification accuracies to those that would be 
expected based on the Erdkan and Julian guidelines. 

Cut-scores 

Two cut-scores are associated with the three proficiency levels reported 
by the FSA. The cut-scores were originally set on the raw-score scale by 
the British Columbia Ministry of Education. Fiowever, analyses included 
IRT calibrations that allowed examining measurement precision at 
different score points. Therefore, the description here includes both the 
cut-scores in terms of raw score points as well as scale scores and 
measurement precision based on IRT calibrations. 

After calibration using the 3PL and 2PPC models, for dichotomous 
and polytomous items, respectively, scales were created for each test. 
The 6 scale was transformed to range between 0 and 600 by multiplying 
6 scores that ranged between -4.00 to +4.00 by 75, the desired standard 
deviation, and adding 300, the desired mean. The scale scores that 
corresponded to the cut-scores on the raw score scale were determined 
by using the test characteristic curves that map raw scores onto 6 score 
scale, which in return can be converted to a scale score. One factor that 
affects classification accuracy is the distance between the cut-scores. 
When cut-scores are closer to each other, the classification accuracy is 
expected to be lower. As can be seen in Table 3, the difference between 
the two cut scores ranged from 14 (for reading grade 4) to 22 raw score 
points (for all numeracy tests). The shortest distance between the cut- 
scores on the FSA tests correspond to 1.6 standard deviation of raw 
scores on the reading grade-4 test and the largest was approximately 2.2 
standard deviation of raw scores on the three numeracy tests. The scale 
score cut-score differences ranged from 129 (for reading grade 4) to 236 
(for numeracy grade 7). 

Measurement Precision at Cut-score Points 

The SEM was calculated for each scale score point. Based on the IRT 
methodology, the SEM is on the 0 scale. Using the test characteristic 
curves, the scale scores and their corresponding SEM values on the 0 




832 


Kadriye Erokan 


scale for each cut-score point were determined and are presented in 
Table 4. The SEM values at the first cut-score point were similar for 
reading and numeracy tests. However, the SEM values at the second cut- 
score were considerably larger for reading tests than numeracy tests. In 
addition, the second set of cut-scores was on parts of the scale where 
measurement precision was lower for all tests except for the numeracy 
grade-10 test. 

Table 4: Standard error of measurement at cut-score points 


Subject 

Grade 

Cut-Scores 



1 

2 

Reading 

4 

25 

40 


7 

27 

54 


10 

24 

54 

Numeracy 

4 

27 

34 


7 

30 

35 


10 

30 

25 


Expected Classification Accuracy for the FSA based on the Ercikan and Julian 
Guidelines 

The Ercikan and Julian guidelines require two types of information 
about the assessment to determine the expected classification accuracy 
levels: the reliability of the tests and the number of desirable proficiency 
levels. The FSA reported individual student performances as well as 
group level performances using three proficiency levels (Not Yet Within 
Expectations, Meeting Expectations, and Exceeding Expectations). The 
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reliability as estimated by coefficient-a ranged from 0.84 to 0.88. Using 
the reliability estimates and the number of proficiency levels in fhe FSA 
fesfs, fhe Ercikan and Julian guidelines were used fo defermine fhe 
expecfed proficiency levels for fhe six fesfs. Given fhaf fhe six FSA tesfs 
had similar reliabilities ranging from 0.84 fo 0.88, fhe expecfed 
classificafion accuracy ranges were defermined to be the same for all six 
fesfs. These expecfed ranges of classificafion accuracy are presenfed in 
Table 5. The expecfed classificafion accuracy po ranged from 0.80 fo 0.90 
and fhe expecfed classificafion accuracy k ranged from 0.65 fo 0.75. The 
expecfed classificafion accuracy has a wide range because of variability 
in where the cut-scores are placed on the score scale and the 
measurement precision associated with these cut-scores. 

Table 5: Verification of Classification Accuracy Based on the Foundation Skills 
Assessment (FSA) 2000 Sample Data 


Subject 

Grade 

Expected' 

po 

(k) 

Estimated 

po 

(k) 

Adjusted 

po 

(k) 

Reading 

4 

0.80 - 0.90 

0.75 (0.59) 

0.77 



(0.65-0.75) 


(0.69) 


7 

0.80 - 0.90 

0.78 (0.56) 

0.80 



(0.65-0.75) 


(0.66) 


10 

0.80 - 0.90 

0.79 (0.60) 

0.81 



(0.65-0.75) 


(0.70) 

Numeracy 

4 

0.80 - 0.90 

0.83 (0.63) 

0.85 



(0.65-0.75) 


(0.73) 


7 

0.80 - 0.90 

0.84 (0.64) 

0.86 



(0.65-0.75) 


(0.74) 


10 

0.80 - 0.90 

0.83 (0.65) 

0.85 



(0.65-0.75) 


(0.75) 


1 Based on Ercikan and Julian guidelines 
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Classification Accuracy Estimates for FSA tests 

For the Beta-nomial model, the main assumption that scores be 
distributed Beta-nomially was verified by visual examination of fhe 
graphical display of fhe disfribufion of scores from fhe six FSA fesfs. 
These disfribufions indicafed fhaf fhe Beta-nomial distribution would be 
a reasonable assumption. The Beta-nomial procedure was applied to the 
six FSA tests and classification accuracies were estimated. The results are 
presented in Table 5. The estimated classification accuracy, po, ranged 
from 0.75 (for fhe reading grade-4 fesf) fo 0.84 (for fhe numeracy grade-7 
fesf). The K'esfimafes ranged from 0.56 (for fhe reading grade 8 fesf) fo 
0.65 (for fhe numeracy grade-10 fesf). Previous research on fhe Befa- 
nomial classificafion accuracy esfimafes indicafed fhaf po esfimafes had - 
2% bias, and fhe k esfimafes had -10% bias (Fluynh & Saunders, 1980). 
These bias esfimafes mean fhaf using fhe Befa-nomial procedure, on fhe 
average, po would be esfimafed fo be fwo per cenf less fhan if acfually is, 
would be esfimafed fo be 10 per cenf less fhan if acfually is. Therefore, 
fhe classificafion esfimafes were correcfed for fhese biases fo gef more 
accurafe esfimafes. For example, fhe esfimafed po for reading grade 4 
increased from 0.75 fo 0.77 affer an adjusfmenf for -2% bias. An 
adjusfmenf for fhe -10% bias on esfimafed k for reading grade 4 
increased fhe esfimafe from 0.59 fo 0.69. The adjusfed esfimafes for po 
and K are presenfed in Table 5. These adjusfed esfimafes ranged from 
0.77 fo 0.86, for po, and fhey ranged from 0.69 fo 0.75, for k. The reading 
fesfs had consisfenfly lower classificafion accuracies fhan fhe numeracy 
fesfs. Alfhough measuremenf precision af fhe firsf cuf-score poinfs were 
similar for all fesfs, numeracy fesfs had higher measuremenf precision af 
fhe second cuf-score and fhey had greafer disfances befween fhe cuf- 
scores, which may have led fo fhe higher classificafion accuracies for 
fhese fesfs. 

Comparison of Estimated versus Expected Classification Accuracies 

The expecfed classificafion accuracy ranges based on fhe Ercikan and 
Julian guidelines were compared fo fhe esfimafed classificafion 
accuracies. The estimated po was wifhin fhe range of expecfed values for 
fhe numeracy fesfs. Flowever, fhey were lower fhan fhe expecfed ranges 
for fhe reading fesfs. Estimated k, on fhe ofher hand, was lower fhan 
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those predicted by the guidelines for all tests. When the estimates were 
adjusted for fhe expecfed negative bias, by -2% for po and by -10% for fhe 
K esfimafes, all estimated po and k fell wifhin fhe range of expected po and 
K, for all fesfs excepf for fhe grade 4 reading fesf. For fhis fesf, even 
fhough fhe estimated k fell wifhin fhe range of expected k values, fhe 
expected po value was approximafely 3 per cenf less fhan fhe lower bound 
of fhe range of expected po. 

SUMMARY AND DISCUSSION 

In fhis article, fhe aufhor summarized fhe guidelines provided in fhe 
Ercikan and Julian (2002) regarding fhe dassificafion accuracy of 
proficiency levels and examined fhe accuracy of fhese guidelines. The 
accuracy of the guidelines was evaluated by comparing the expected 
classification accuracy based on these guidelines to the estimated 
classification accuracy using Huynh's Beta-nomial classification accuracy 
estimation procedure. These comparisons were conducted using the FSA 
2000 assessments as examples. The results of fhe esfimafion procedure 
showed fhaf fhe FSA assessmenfs had moderafe dassificafion accuracy 
levels fhaf had po ranging befween 0.77 and 0.86. 

The classification accuracies esfimafed based on fhe sfafisfical 
procedure were all wifhin fhe expecfed range of dassificafion accuracies 
based on fhe guidelines. The only exception was fhe esfimafed po for fhe 
reading grade-4 fesf which had an esfimafed po fhaf was 3 per cenf less 
fhan fhe lower bound of fhe range of fhe expecfed po. The small 
inconsisfency befween fhe expecfed and esfimafed classification accuracy 
po for fhis fesf may be due fo pofenfial error in fhe guidelines because 
fhey do nof fake fhe disfance befween cuf-scores and measuremenf 
precision info accounf, as well as possible bias greafer fhan -2 per cenf in 
fhe sfafisfical esfimafion procedure. Overall, fhe findings indicafe fhaf 
fhe guidelines provided by Ercikan and Julian (2002) are reasonable rule 
of fhumb fo follow af fhe planning sfage of an assessmenf design, when 
fesf developers do nof have dafa needed fo esfimafe fhese dassificafion 
accuracies. 

The Ercikan and Julian guidelines are expecfed fo inform decisions 
abouf number of proficiency levels fo use in an assessmenf, expecfed 
level of dassificafion accuracy for an assessment with predetermined 
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number of proficiency levels, and fesf lengfh for a desired level of 
classificafion accuracy and number of proficiency levels. To defermine 
fhe number of proficiency levels for an assessment assessmenf 
developers need firsf fo decide fhe minimum classificafion accuracy fhaf 
would be accepfable for fhe consumers of fhe assessmenf resulfs, such as 
educators and policy makers. Deciding on a level of classificafion 
accuracy is nof fhe same as deciding on an appropriafe level of reliabilify 
for a fesf. The developers need fo consider fhe accepfable level of 
misclassificafions, bofh false-posifive and false-negative, and fhe cosfs 
associafed wifh such misclassificafions. For example, when classificafion 
accuracy is expected fo be 0.80-0.90, assessmenf developers need fo 
consider fhe implicafions of misclassifying 20 per cenf of sfudenfs info a 
wrong proficiency level, as well as on decisions such as resource 
allocation and remediation programs. The desirable classificafion 
accuracy level can be combined wifh fhe informafion abouf fhe reliabilify 
of fhe fesf fo defermine fhe number of proficiency levels based on fhe 
guidelines. Once fhe number of proficiency levels is defermined, where 
cuf-scores are esfablished on fhe score scale and how far aparf fhe cuf- 
scores are, will affecf fhe acfual classificafion accuracy of fhe proficiency 
level scores. To achieve opfimal levels of classificafion accuracy, fhe cuf- 
scores should be esfablished on poinfs of fhe score scale where 
measuremenf precision is maximized. They should also be sef as far 
aparf on fhe score scale as possible, in addifion fo considerafions given fo 
criteria that may include behavioural expectations regarding 
performance on different parts of the scale. 

Similarly to determine the expected level of classification accuracy 
for an assessment with a predetermined number of proficiency levels, 
the main information needed is the measurement precision provided by 
the test. However, the further apart the cut-scores are from each other, 
the higher is the likelihood that the expected classification accuracy level 
will be close to the actual classification accuracy. 

The Ercikan and Julian guidelines also provide information about 
the number of test items needed for a desired level of classification 
accuracy and number of proficiency levels. It is important to highlight 
that test items that contribute to measurement precision on parts of the 
scale that are likely to have the cut-scores should be prioritized in 
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constructing tests. On parts of the scale with high levels of measuremenf 
precision, examinees af differenf abilify levels are better discriminafed 
and, fherefore, are less likely fo be misclassified. 

The Ercikan and Julian guidelines were evaluafed based on a 
classificafion esfimafion procedure fhaf ifself has some error associafed 
wifh if. The classificafion accuracy, consisfency of classificafions, of 
examinees can be examined more validly using fwo fesf adminisfrafions 
of parallel tesfs or fhe same fesf. The nexf sfep in evaluafing fhe Ercikan 
and Julian guidelines should focus on comparing fhe guidelines fo 
classificafion consisfency based on fwo fesf adminisfrafions. 
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